fix: proxy-webhook selector matches operator pods#3228
Conversation
|
|
There was a problem hiding this comment.
Pull request overview
Fixes a production routing bug where the tekton-operator-proxy-webhook Service selector unintentionally matched both proxy-webhook and main operator pods, causing intermittent admission webhook failures and rejected TaskRun Pod creation.
Changes:
- Update the proxy-webhook Deployment selector + pod template label to use
name: tekton-operator-proxy-webhook. - Update the proxy-webhook Service selector to match the new pod label (Kubernetes + OpenShift manifests).
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| cmd/kubernetes/operator/kodata/webhook/webhook.yaml | Aligns proxy-webhook Deployment/Service selectors to target only proxy-webhook pods on Kubernetes. |
| cmd/openshift/operator/kodata/webhook/webhook.yaml | Same selector/label fix for the OpenShift manifest. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@bowling233 , thank for your PR.
|
|
@bowling233 - Request you to address the review comment, if you are still pursuing this PR. Thank you. 🙇♀️ |
|
So sorry for the late response! I won't have the bandwidth to properly validate these changes until next month. I can confirm this approach has side effects—specifically, the HorizontalPodAutoscaler is hitting This PR definitely needs more refinement to handle the selector immutability and the HPA configuration. Should I move this to a Draft for now, or would you prefer I close this and resubmit once I've validated a full fix? |
|
@bowling233 - would it be possible to progress this PR? |
Both the `tekton-operator` and `tekton-operator-proxy-webhook` Deployments label their Pods with `name: tekton-operator`. The `tekton-operator-proxy-webhook` Service uses this same label as its only selector, so it inadvertently load-balances traffic across both Deployments. Because `tekton-operator` pods do not serve on port 8443, ~50% of admission webhook requests fail: failed calling webhook "proxy.operator.tekton.dev": Post ".../tekton-operator-proxy-webhook.../defaulting": dial tcp <ClusterIP>:443: connect: connection refused Because MutatingWebhookConfiguration has `failurePolicy: Fail`, each such failure immediately rejects TaskRun Pod creation. Rename the proxy-webhook Deployment's selector matchLabels and pod template label from `name: tekton-operator` to `name: tekton-operator-proxy-webhook`, and update the Service selector to match. The `app: tekton-operator` label is left unchanged. Applies to both Kubernetes and OpenShift manifests. Adding a set-based (NotIn) expression to the Service selector instead was not viable as Kubernetes Services only support equality-based (matchLabels) selectors.
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
@jkhelil Thanks for the review. I checked the upgrade path and also clarified why this issue happens in our environment. The root cause is that our deployment uses a single namespace for both the operator and its operands. We install the release manifest through Kustomize with Before this fix, the proxy webhook Service selected only While testing upgrade from So I added upgrade handling in the TektonInstallerSet reconciler: when an existing Deployment selector differs from the desired selector, the reconciler deletes the old Deployment and lets the next reconcile recreate it with the new selector. Validation detailsI verified the failure mode by simulating the old selector behavior in minikube:
After applying the fix:
|
Delete and recreate managed Deployments when their selector changes, because Kubernetes treats spec.selector as immutable. This keeps upgrades working for proxy-webhook selector migrations. Co-authored-by: OpenAI Codex <codex@openai.com>
Changes
Fixes #3227
Both the
tekton-operatorandtekton-operator-proxy-webhookDeploymentslabel their Pods with
name: tekton-operator. Thetekton-operator-proxy-webhookService uses this same label as its onlyselector, so it inadvertently load-balances traffic across both Deployments.
Because
tekton-operatorpods do not serve on port 8443, ~50% of admissionwebhook requests fail with
connection refused. Since theMutatingWebhookConfiguration has
failurePolicy: Fail, each failureimmediately rejects TaskRun Pod creation.
Changes:
cmd/kubernetes/operator/kodata/webhook/webhook.yaml: rename theproxy-webhook Deployment's
matchLabelsselector and pod template labelfrom
name: tekton-operatortoname: tekton-operator-proxy-webhook;update the Service
selectorto match.cmd/openshift/operator/kodata/webhook/webhook.yaml: same change for theOpenShift manifest.
The existing
app: tekton-operatorlabel is preserved on both Deployments.No other resources are affected.
Alternative considered: adding a set-based (
NotIn) expression to theService selector to exclude
tekton-operatorpods. This was not viablebecause Kubernetes Services only support equality-based (
matchLabels)selectors.
Submitter Checklist
These are the criteria that every PR should meet, please check them off as you
review them:
make test lintbefore submitting a PRSee the contribution guide for more details.
Release Notes